Exercise 3 – Classification I

Introduction to Machine Learning

Exercise 1: Logistic vs Softmax Regression [only for lecture group A]

Exercise 1: Logistic vs Softmax Regression [only for lecture group A]

Binary logistic regression is a special case of multiclass logistic, or softmax, regression. The softmax function is the multiclass analogue to the logistic function, transforming scores \(\boldsymbol{\theta}^\top \mathbf{x}\) to values in the range [0, 1] that sum to one. The softmax function is defined as:

\[ \pi_k(\mathbf{x}| \boldsymbol{\theta}) = \frac{\exp(\boldsymbol{\theta}_k^\top \mathbf{x})}{\sum_{j=1}^{g} \exp(\boldsymbol{\theta}_j^\top \mathbf{x})}, k \in \{1,...,g\} \]

Show that logistic and softmax regression are equivalent for \(g = 2\).

Learning goals

TBD

Solution

As we would expect, the two formulations are equivalent (up to reparameterization). In order to see this, consider the softmax function components for both classes:

\[ \pi_1(\mathbf{x}| \boldsymbol{\theta}) = \frac{\exp(\boldsymbol{\theta}_1^\top \mathbf{x})}{\exp(\boldsymbol{\theta}_1^\top \mathbf{x}) + \exp(\boldsymbol{\theta}_2^\top \mathbf{x})} \]

\[ \pi_2(\mathbf{x}| \boldsymbol{\theta}) = \frac{\exp(\boldsymbol{\theta}_2^\top \mathbf{x})}{\exp(\boldsymbol{\theta}_1^\top \mathbf{x}) + \exp(\boldsymbol{\theta}_2^\top \mathbf{x})} \]

Since we know that \(\pi_1(\mathbf{x}| \boldsymbol{\theta}) + \pi_2(\mathbf{x}| \boldsymbol{\theta}) = 1\), it is sufficient to compute one of the two scoring functions. Let’s pick \(\pi_1(\mathbf{x}| \boldsymbol{\theta})\) and relate it to the logistic function:

\[ \pi_1(\mathbf{x}| \boldsymbol{\theta}) = \frac{1}{1 + \exp(\boldsymbol{\theta}_2^\top \mathbf{x}- \boldsymbol{\theta}_1^\top \mathbf{x})} = \frac{1}{1 + \exp(-\boldsymbol{\theta}^\top \mathbf{x})} \]

where \(\boldsymbol{\theta}:= \boldsymbol{\theta}_1 - \boldsymbol{\theta}_2\). Thus, we obtain the binary-case logistic function, reflecting that we only need one scoring function (and thus one set of parameters \(\boldsymbol{\theta}\) rather than two \(\boldsymbol{\theta}_1, \boldsymbol{\theta}_2\)).

Exercise 2: Hyperplanes [only for lecture group B]

Learning goals

TBD

Linear classifiers like logistic regression learn a decision boundary that takes the form of a (linear) hyperplane. Hyperplanes are defined by equations \(\boldsymbol{\theta}^\top \mathbf{x}= b\) with coefficients \(\boldsymbol{\theta}\) and a scalar \(b \in \mathbb{R}\).

In order to see that such expressions actually describe hyperplanes, consider \(\boldsymbol{\theta}^\top \mathbf{x}= \theta_0 + \theta_1 x_1 + \theta_2 x_2 = 0\). Sketch the hyperplanes given by the following coefficients and explain the difference between the parameterizations:

  • \(\theta_0 = 0, \theta_1 = \theta_2 = 1\)
  • \(\theta_0 = 1, \theta_1 = \theta_2 = 1\)
  • \(\theta_0 = 0, \theta_1 = 1, \theta_2 = 2\)
Solution

A hyperplane in 2D is just a line. We know that two points are sufficient to describe a line, so all we need to do is pick two points fulfilling the hyperplane equation.

  • \(\theta_0 = 0, \theta_1 = \theta_2 = 1\) \(\rightsquigarrow\) e.g., (0, 0) and (1, -1). Sketch it:

  • \(\theta_0 = 1, \theta_1 = \theta_2 = 1\) \(\rightsquigarrow\) e.g., (0, -1) and (1, -2). The change in \(\theta_0\) promotes a horizontal shift:

  • \(\theta_0 = 0, \theta_1 = 1, \theta_2 = 2\) \(\rightsquigarrow\) e.g., (0, 0) and (1, -0.5). The change in \(\theta_2\) pivots the line around the intercept:

We see that a hyperplane is defined by the points that lie directly on it and thus fulfill the hyperplane equation.

Exercise 3: Decision Boundaries & Thresholds in Logisitc Regression

Learning goals

TBD

In logistic regression (binary case), we estimate the probability \(P(y = 1 | \mathbf{x}, \boldsymbol{\theta}) = \pi(\mathbf{x}| \theta)\). In order to decide about the class of an observation, we set \(\hat{y} = 1\) iff \(\pi(x | \theta) \geq \alpha\) for some \(\alpha \in (0, 1)\).

  1. Show that the decision boundary of the logistic classifier is a (linear) hyperplane. Hint: derive the value of \(\boldsymbol{\theta}^T\mathbf{x}\) (depending on \(\alpha\)) starting from which you predict \(\hat{y} = 1\) rather than \(\hat{y} = 0\).

  2. Below you see the logistic function for a binary classification problem with two input features for different values \(\boldsymbol{\theta}= (\theta_1, \theta_2)\) (plots 1-3) as well as \(\alpha\) (plot 4). What can you deduce for the values of \(\theta_1\), \(\theta_2\), and \(\alpha\)? What are the implications for classification in the different scenarios?

  1. Derive the equation for the decision boundary hyperplane if we choose \(\alpha = 0.5\).

  2. Explain when it might be sensible to set \(\alpha\) to 0.5.

Solution